Tika File Adapter

This adapter provides functions to parse files stored in HDFS in various formats using Apache Tika library. It is described in the following topics:

Built-in Library Functions for Parsing Files with Tika

To use the built-in functions in your query, you must import the Tika file module as follows:

import module "oxh:tika";

The Tika file module contains the following functions:

For examples, see "Examples of Tika File Adapter Functions."

tika:collection

Parses files stored in HDFS in various formats and extracts the content or metadata from them.

Signature

declare %tika:collection function
   tika:collection($uris as xs:string*) as document-node()* external;

declare %tika:collection function
   function tika:collection($uris as xs:string*, $contentType as xs:string?) as document-node()* external;

Parameters

$uris: The HDFS file URIs.

$contentType: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.

Returns

Returns a document node for each value. See "Tika Parser Output Format".

tika:parse

Parses the data given to it as an argument.For example, it can parse an html fragment within an XML or JSON document.

Signature

declare function
   tika:parse($data as xs:string?, $contentType as xs:string?) as document-node()* external;

Parameters

$data: The value to be parsed.

$contentType: Specifies the media type of the content to parse, and may have the charset attribute. When the parameter is specified, then it defines both type and encoding. When not specified, then Tika will attempt to auto-detect values from the file extension. Oracle recommends you to specify the parameter.

Returns

Returns a document node for each value. See "Tika Parser Output Format".

Custom Functions for Parsing Files with Tika

You can use the following annotations to define functions to parse files in HDFS with Tika. These annotations provide additional functionality that is not available using the built-in functions.

Signature

Custom functions for reading HDFS files must have one of the following signatures:

declare %tika:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*, $contentType as xs:string?) as document-node()* external;
declare %tika:collection [additional annotations]
   function local:myFunctionName($uris as xs:string*) as document-node()* external;

Annotations

%tika:collection(["method"])

Identifies an external function to be implemented by Tika file adapter. Required.

The optional method parameter can be one of the following values:

  • tika: Each line in the tika file is returned as document-node(). Default.

%output:media-type

Declares the file content type. It is a MIME type and must not have the charset attribute as per XQuery specifications. Optional.

%output:encoding

Declares the file character set. Optional.

Note:

%output:media-type and %output:econding annotations specify the content type or encoding when the $contentType parameter is not explicitly provided in the signature.

Parameters

$uris as xs:string*

Lists the HDFS file URIs. Required.

$contentType as xs:string?

The file content type. It may have the charset attribute.

Returns

document-node()* with two root elements. See "Tika Parser Output Format".

Tika Parser Output Format

The result of Tika parsing is a document node with two root elements:

  • Root element #1 is an XHTML content produced by Tika.

  • Root element #2 is the document metadata extracted by Tika.

The format of the root elements look like these:

Root element #1

<html xmlns="http://www.w3.org/1999/xhtml">
...textual content of Tika HTML...
</html>

Root element #2

<tika:metadata xmlns:tika="oxh:tika">
   <tika:property name="Name_1">VALUE_1</tika:property>
   <tika:property name="NAME_2">VALUE_2</tika:property>
</tika:metadata>

Tika Adapter Configuration Properties

The following Hadoop properties control the behavior of Tika adapter:

oracle.hadoop.xquery.tika.html.asis

Type:Boolean

Default Value: false.

Description: When this is set to TRUE, then all the HTML elements are omitted during parsing. When this is set to FALSE, then only the safe elements are omitted during parsing.

oracle.hadoop.xquery.tika.locale

Type:Comma-separated list of strings

Default Value:Not Defined.

Description:Defines the locale to be used by some Tika parsers such as Microsoft Office document parser. Only three strings are allowed: language, country, and variant. The strings country and variant are optional. When locale is not defined, then the system locale is used. When the strings are defined it must correspond to the java.util.Locale specification format mentioned in http://docs.oracle.com/javase/7/docs/api/java/util/Locale.html and the locale can be constructed as follows:

  • If only language is specified, then the locale is constructed from the language.

  • If the language and country are specified, then the locale is constructed from both language and country

  • If language, country, and variant are specified, then the locale is constructed from language, country, and variant.

Examples of Tika File Adapter Functions

Example 1   Using Built-in Functions to Index PDF documents with Cloudera Search

This example query uses Tika to parse PDF files into HTML form and then add the HTML documents into Solr's full-text index.

*bigdata*.pdf
 

The following query indexes the HDFS files:

import module "oxh:tika";
import module "oxh:solr";
 
for $doc in tika:collection("*bigdata*.pdf")
let $docid := data($doc//*:meta[@name eq "resourceName"]/@content)[1]
let $body := $doc//*:body[1]
return
   solr:put(
        <doc> 
            <field name="id">{ $docid }</field>
            <field name="text">{ string($body) }</field>
            <field name="content">{ serialize($doc/*:html) }</field>
         </doc> 
   )
 

The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.

Example 2   Using Built-in Functions to Index HTML documents with Cloudera Search

This example query uses sequence files and Tika to parse, where key is an URL and value is a html.

import module "oxh:tika";
import module "oxh:solr";
import module "oxh:seq";

for $doc in seq:collection-tika(”/path/to/seq/files/*")
let $docid := document-uri($doc)
let $body := $doc//*:body[1]
return
   solr:put(
      <doc>
         <field name="id">{ $docid }</field>
         <field name="text">{ string($body) }</field>
         <field name="content">{ serialize($doc/*:html) }</field>
      </doc>
   )

The HTML representation of the documents is added to Solr index and they become searchable. Each document Id in the index is the file name.